Beginning operation in 2013, the Ford GoBike system currently has over 2,600 bicycles in 262 stations across San Francisco, East Bay and San Jose. Similar to other bike sharing systems in the country, Ford GoBike has suffered from the problem of systematic imbalance. Temporally, bikes tend to pile up downtown in the morning and on the outskirts in the afternoon. Spatially, people are happy to rent a bike and roll on down the hill, but reluctant to go the other way. Therefore, it is common to see that some stations are empty (or full), preventing users from borrowing (or returning) bikes. These imbalances cause inconvenience in using Ford GoBike services.
In order to make Ford Gobike more connected, smart, and efficient, we have designed an App to deal with the imbalance problem. With the new App, users are able to plan their trips by setting a location and a future departure time. The App will show the predicted number of bikes and docks in the time selected. Then, it will provide three most recommended routes according to the availability of bikes and docks at the origin and destination, respectively. The App will also encourage users moving bikes from full stations to empty stations by giving bonus points as coupons.
The main use case of the App, namely predicting hourly bike/dock availability, is based on two Poisson Regression models. Though there are still rooms for improvement, the models are proved to be both accurate and generalizable. Overall, the differences between predicted and observed values are less than 1.7, indicating the number of departure/arrival our model generates will just slightly differ from the true values. Additionally, the models’ performances are quite stable across time and space. We thus firmly believe this methodology can guide the rebalancing process of Ford Gobike, as well as the bike sharing systems in other U.S. cities.
The rest of this material will detail the model building and validation process, and how these models meet the use case we set out to address.
The purposes of our models are to predict hourly departure and arrival of each bikeshare station, respectively. We select 144 Ford GoBike stations, and download bike trip records from 10/14/2018 to 10/20/2018. Below shows the location of the stations involved in this study.
ggmap(map) +
geom_point(data = sf_bike_WGS.xy, aes(x = X, y = Y), color = '#244ead') +
labs(title = "Location of Stations in Study",
subtitle = 'San Francisco, CA') +
theme(plot.title=element_text(size=17, face="bold", vjust=-1),
plot.subtitle=element_text(size=12, face="italic", color="grey"),
legend.text = element_text(face = "italic"),
axis.ticks = element_blank(),
axis.text = element_blank(),
axis.title.x = element_blank(),
axis.title.y = element_blank(),
panel.border = element_rect(colour = "grey", fill=NA, size=2))
Independent variables include seven categories - internal characteristics, access to amenities, natural environment, social characteristics, land use, time, and lagged variables. Detailed information of the variables we use are shown in the table below.
| Variable | Category | Symbol | Description |
|---|---|---|---|
| Dependent variable | —– | departure | Number of hourly departure |
| arrival | Number of hourly Arrival | ||
| Independent variable | Internal characteristic | eastMost | Dummy variable. Whether the station is located in Financial District, Mission Bay, Mission, or South of Market |
| weekend_hot | Dummy variable. Whether the station is popular at weekends | ||
| Access to ameneties | dist_Park1 | Distance to nearest park | |
| log_dist_ferryBuilding | Distance to San Francisco Ferry Building Station in log format | ||
| log_dist_bus3 | Average distance to nearest 3 bus stops in log format | ||
| log_dist_Townsend | Distance to San Francisco Caltrain (Townsend St at 4th St) Station in log format | ||
| log_dist_crash1 | Distance to the nearest crash in log format | ||
| log_dist_bus1 | Distance to the nearest bus stop in log format | ||
| Natural environment | Condition | Categorical variable. Weather condition. There are 10 levels in total - cloudy, cloudy/windy, fair, fair/windy, fog, haze, mostly cloudy, mostly cloudy/windy, partly cloudy, partly cloudy/windy. | |
| Wind.Speed | Speed of wind | ||
| Social characteristic | log_jobs | Number of jobs in the census block group where the station is located (after log transformation) | |
| pop | Number of population in the census block group where the station is located | ||
| log_dist_crime1 | Distance to the nearest crime in log format | ||
| log_transit | Number of people taking transit to work in the census block group where the station is located (after log transformation) | ||
| houseAge | Median house age in the census block group where the station is located | ||
| log_housingUnits | Number of housing units in the census block group where the station is located (after log transformation) | ||
| housingUnits | Number of housing units in the census block group where the station is located | ||
| log_bike | Number of people taking bicycle to work in the census block group where the station is located (after log transformation) | ||
| Land use | log_pct_VACANT | Percentage of vacant land in the half-mile buffer of bikeshare station (after log transformation) | |
| MIPS | Square footage of office land use in the half-mile buffer of bikeshare station | ||
| pct_VACANT | Percentage of vacant land in the half-mile buffer of bikeshare station | ||
| log_MED | Square footage of medical land use in the half-mile buffer of bikeshare station (after log transformation) | ||
| Time | whichday | Categorical variable. There are 7 levels in total - Mon, Tue, Wed, Thur, Fri, Sat, and Sun | |
| Lag (time) | departure_3h | Number of departure 3 hours ago | |
| departure_24h | Number of departure 24 hours ago | ||
| departure_1w | Number of departure 1 week ago | ||
| arrival_2h | Number of arrival 2 hours ago | ||
| arrival_3h | Number of arrival 3 hours ago | ||
| arrival_1w | Number of arrival 1 week ago | ||
| arrival_1w1h | Number of arrival 1 week and 1 hour ago | ||
| Lag (space-time) | nearest5_departure_2h | Average number of departure 2 hours ago of nearest 5 stations | |
| nearest5_departure_1w | Average number of departure a week ago of nearest 5 stations | ||
| nearest5_arrival_2h | Average number of arrival 2 hours ago of nearest 5 stations | ||
| nearest5_arrival_1w | Average number of arrival 1 week and 1 hour ago of nearest 5 stations | ||
| nearest5_arrival_1w1h | Average number of arrival 1 week and 1 hour age of nearest 5 stations |
All the data comes from the following five sources: 1. San Francisco open data 2. Ford GoBike data 3. ACS 2013 - 2017 5-year estimates 4. Jobs 5. Weather
In this section, we will first look at the distributions of the two dependent variable - the number of departure of each station in each hour, and the number of arrival of each station in each hour. This analysis can inform us which kind of regression we should use.
ggplot(vars_train) +
geom_histogram(aes(x = departure), fill = '#b8d1e3') +
labs(title = "Distribution of Hourly Departure",
subtitle = 'San Francisco, CA',
x = 'Departure',
y = 'Count') +
theme_minimal() +
theme(plot.title=element_text(size=17, face="bold", vjust=-1),
plot.subtitle=element_text(size=12, face="italic", color="grey"),
legend.title=element_blank())
ggplot(vars_train) +
geom_histogram(aes(x = arrival), fill = '#b8d1e3') +
labs(title = "Distribution of Hourly Arrival",
subtitle = 'San Francisco, CA',
x = 'Arrival',
y = 'Count') +
theme_minimal() +
theme(plot.title=element_text(size=17, face="bold", vjust=-1),
plot.subtitle=element_text(size=12, face="italic", color="grey"),
legend.title=element_blank())
As shown in the above histograms, the two distributions are similar to one another. There are two key takeaways from these plots: the dependent variables are counts, hence they are discrete instead of continuous; and the dependent variables follow Poisson distribution.
It indicates that Poisson regression is more suitable for the models.
The training set we use records the number of departures/arrivals for each station in every hour from Oct. 14th to Oct. 20th. Here, we calculate the total number of departures/arrivals in every single day. Not surprisingly, the patterns of departure are similar to the ones of arrival, for the total number of departures should equal that of arrivals in each day. In addition, the trends of weekday are considerably different from the ones of weekend. On weekdays, there are typically two peaks when the number of trips are extremely high, while at weekends, the curves look like bumps, and peak hours do not exist. It indicates that this time fixed effect should be identified when building the models.
vars_train %>%
mutate(whichday = factor(whichday, levels = c("Mon", "Tue", "Wed", "Thur", "Fri", "Sat", "Sun"))) %>%
group_by(whichday, hour) %>%
summarize(sum_departure = sum(departure)) %>%
mutate(hour = as.numeric(hour)) %>%
ggplot(.) +
geom_line(aes(x = hour, y = sum_departure, color = whichday)) +
scale_x_continuous(breaks = seq(6, 23, by = 1)) +
labs(title = "Training Set Observed Departure Comparison",
subtitle = 'San Francisco, CA',
y = 'Count') +
theme_minimal() +
theme(plot.title=element_text(size=17, face="bold", vjust=-1),
plot.subtitle=element_text(size=12, face="italic", color="grey"),
legend.title=element_blank())
vars_train %>%
mutate(whichday = factor(whichday, levels = c("Mon", "Tue", "Wed", "Thur", "Fri", "Sat", "Sun"))) %>%
group_by(whichday, hour) %>%
summarize(sum_arrival = sum(arrival)) %>%
mutate(hour = as.numeric(hour)) %>%
ggplot(.) +
geom_line(aes(x = hour, y = sum_arrival, color = whichday)) +
scale_x_continuous(breaks = seq(6, 23, by = 1)) +
labs(title = "Training Set Observed Arrival Comparison",
subtitle = 'San Francisco, CA',
y = 'Count') +
theme_minimal() +
theme(plot.title=element_text(size=17, face="bold", vjust=-1),
plot.subtitle=element_text(size=12, face="italic", color="grey"),
legend.title=element_blank())
The histograms below are each neighborhood’s total number of departures and arrivals in the training set, respectively. The top 4 neighborhoods are identical in both histograms, and the number of trips of them are considerably higher than their counterparts. It indicates that this neighborhood fixed effect should be identified when building models. In fact, as can be noted later, both of our models include an independent variable called “eastMost” - if a certain station is located within these top 4 neighborhoods, they will be coded as 1.
vars_train %>%
group_by(nhood) %>%
summarize(departure = sum(departure)) %>%
ggplot(.) +
geom_histogram(aes(x = reorder(nhood,-departure), y = departure, fill = nhood), stat="identity") +
labs(title="Training Set Observed Departure of each Neighborhood",
subtitle='San Francisco, CA',
x="Neighborhoods",
y="Number of Departure") +
theme_minimal() +
theme(plot.title=element_text(size=17, face="bold", vjust=-1),
plot.subtitle=element_text(size=12, face="italic", color="grey"),
axis.text.x = element_text(angle = 90, hjust = 1),
legend.position='none')
vars_train %>%
group_by(nhood) %>%
summarize(arrival = sum(arrival)) %>%
ggplot(.) +
geom_histogram(aes(x = reorder(nhood,-arrival), y = arrival, fill = nhood), stat="identity") +
labs(title="Training Set Observed Arrival of each Neighborhood",
subtitle='San Francisco, CA',
x="Neighborhoods",
y="Number of Arrival") +
theme_minimal() +
theme(plot.title=element_text(size=17, face="bold", vjust=-1),
plot.subtitle=element_text(size=12, face="italic", color="grey"),
axis.text.x = element_text(angle = 90, hjust = 1),
legend.position='none')
We perform Moran’s I test to check whether the total number of departures/arrivals of each station in the training set is spatially correlated to its nearest 5 bikeshare stations.
stationDeparture <-
vars_train %>%
group_by(station) %>%
summarize(departure = sum(departure),
weekday = first(weekday)) %>%
inner_join(sf_bike, ., by = c('start_station_name' = 'station')) %>%
st_transform(crs = 4326) %>%
mutate(lon = sf_bike_WGS.xy[,1],
lat = sf_bike_WGS.xy[,2])
stationArrival <-
vars_train %>%
group_by(station) %>%
summarize(arrival = sum(arrival),
weekday = first(weekday)) %>%
inner_join(sf_bike, ., by = c('start_station_name' = 'station')) %>%
st_transform(crs = 4326) %>%
mutate(lon = sf_bike_WGS.xy[,1],
lat = sf_bike_WGS.xy[,2])
### Departure - nearest 5
coords0.2 <- cbind(stationDeparture$lon, stationDeparture$lon)
spatialWeights0.2 <- knn2nb(knearneigh(coords0.2, 5))
### Arrival - nearest 5
coords0.4 <- cbind(stationArrival$lon, stationArrival$lon)
spatialWeights0.4 <- knn2nb(knearneigh(coords0.4, 5))
As detailed by the table below, the p-values of two tests are lower than 0.001, meaning the spatial autocorrelation of both departure and arrival are statistically significant at 99.9% confidence level. This exploratory analysis suggests that this spatial lag effect should not be ignored in our models.
data.frame('Departure' = moran.test(stationDeparture$departure, nb2listw(spatialWeights0.2, style="W"))$p.value,
'Arrival' = moran.test(stationArrival$arrival, nb2listw(spatialWeights0.4, style="W"))$p.value,
row.names = 'Nearest 5') %>%
kable(., caption = "P-values of Moran's I Test") %>%
kable_styling(bootstrap_options = c("striped", "hover"),
font_size = 15,
full_width = F) %>%
row_spec(1, background = "#ecf6fe")
| Departure | Arrival | |
|---|---|---|
| Nearest 5 | 0.0000567 | 0.0000043 |
The sample autocorrelation function (ACF) can be used to identify the possible structure of time series data. It gives correlations between the data series and lagged values of the series for lags of 1, 2, 3, and so on. A value higher or lower than the blue dashed line indicates that the correlation is statistically significant.
We randomly pick one bikeshare station to check whether temporal autocorrelation exists in departures and arrivals. Shown in the plots below, both departure and arrival are significantly correlated with several numbers in the past. This temporal lag effect should be included in our prediction models.
autoplot(acf(vars_train %>%
filter(station_id == '100') %>%
dplyr::select(departure),
lag.max = 24,
plot = FALSE)) +
labs(title = 'ACF for Hourly Departure (Bryant St at 15th St Station)',
subtitle = 'Bike Departure Prediction for San Francisco, CA') +
theme_minimal() +
theme(plot.title=element_text(size=17, face="bold", vjust=-1),
plot.subtitle=element_text(size=12, face="italic", color="grey"))
autoplot(acf(vars_train %>%
filter(station_id == '100') %>%
dplyr::select(arrival),
lag.max = 24,
plot = FALSE)) +
labs(title = 'ACF for Hourly Arrival (Bryant St at 15th St Station)',
subtitle = 'Bike Arrival Prediction for San Francisco, CA') +
theme_minimal() +
theme(plot.title=element_text(size=17, face="bold", vjust=-1),
plot.subtitle=element_text(size=12, face="italic", color="grey"))
In this section, we calculate the sum of departures/arrivals of each station on weekdays as well as at weekend, and arrange them according to the total number in descending order. What can be noted is that the popular stations are different for weekday and weekend. For instance, the number of total departure of San Francisco Caltrain Station 2 is the highest on weekdays, but it is not among the top 10 stations regarding weekend departures. Based on this finding, we will add a dummy variable to identify whether a specific station is popular at weekends in the model building process.
vars_train %>%
filter(weekday == 'weekday') %>%
group_by(station) %>%
summarize(departure = sum(departure)) %>%
arrange(desc(departure)) %>%
head(10) %>%
kable(., col.names = c('Station', 'Weekday Departure')) %>%
kable_styling(bootstrap_options = c("striped", "hover"),
font_size = 15,
full_width = F) %>%
row_spec(seq(1, 10, 2), background = "#ecf6fe")
| Station | Weekday Departure |
|---|---|
| San Francisco Caltrain Station 2 (Townsend St at 4th St) | 1008 |
| Market St at 10th St | 725 |
| San Francisco Caltrain (Townsend St at 4th St) | 722 |
| San Francisco Ferry Building (Harry Bridges Plaza) | 711 |
| Montgomery St BART Station (Market St at 2nd St) | 670 |
| Powell St BART Station (Market St at 4th St) | 664 |
| Berry St at 4th St | 645 |
| Steuart St at Market St | 612 |
| The Embarcadero at Sansome St | 587 |
| Howard St at Beale St | 540 |
vars_train %>%
filter(weekday == 'weekend') %>%
group_by(station) %>%
summarize(departure = sum(departure)) %>%
arrange(desc(departure)) %>%
head(10) %>%
kable(., col.names = c('Station', 'Weekend Departure')) %>%
kable_styling(bootstrap_options = c("striped", "hover"),
font_size = 15,
full_width = F) %>%
row_spec(seq(1, 10, 2), background = "#ecf6fe")
| Station | Weekend Departure |
|---|---|
| The Embarcadero at Sansome St | 171 |
| San Francisco Ferry Building (Harry Bridges Plaza) | 165 |
| Powell St BART Station (Market St at 4th St) | 151 |
| Market St at 10th St | 143 |
| Central Ave at Fell St | 112 |
| Powell St BART Station (Market St at 5th St) | 101 |
| Mission Dolores Park | 80 |
| Valencia St at 24th St | 79 |
| Union Square (Powell St at Post St) | 78 |
| Valencia St at 16th St | 75 |
Below are two multi-scatterplot visualization that display the relationships between the dependent variable and independent variables. For both models, the degree to which the lagged variables are associated with the dependent variable is the highest.
## Departure
vars_train %>%
dplyr::select(departure, nearest5_departure_1w, departure_1w, arrival_3h,
nearest5_arrival_2h, nearest5_arrival_1w1h, departure_3h,
arrival_1w1h, pop, log_housingUnits, houseAge, log_jobs,
log_pct_VACANT, log_transit, log_dist_Townsend,
log_dist_ferryBuilding, log_dist_crash1, log_dist_bus1,
log_dist_crime1, dist_Park1, Wind.Speed) %>%
gather(variable, value, -departure) %>%
ggplot(.) +
geom_point(aes(x = value, y = departure), color = "#b8d1e3") +
facet_wrap(~variable, scales="free") +
labs(title="Relationship between observed departure and independent variables",
subtitle='San Francisco, CA',
x="Independent Variables",
y="Number of Observed Departure") +
theme_minimal() +
theme(plot.title=element_text(size=16, face="bold", vjust=-1),
plot.subtitle=element_text(size=12, face="italic", color="grey"))
## Arrival
vars_train %>%
dplyr::select(arrival, arrival_1w, nearest5_arrival_1w,
arrival_2h, departure_1w, arrival_3h, nearest5_departure_2h,
nearest5_arrival_1w1h, departure_24h, housingUnits, MIPS,
pct_VACANT, log_MED, log_bike, log_dist_Townsend,
log_dist_crash1, log_dist_bus3, Wind.Speed) %>%
gather(variable, value, -arrival) %>%
ggplot(.) +
geom_point(aes(x = value, y = arrival), color = "#b8d1e3") +
facet_wrap(~variable, scales="free") +
labs(title="Relationship between observed arrival and independent variables",
subtitle='San Francisco, CA',
x="Independent Variables",
y="Number of Observed Arrival") +
theme_minimal() +
theme(plot.title=element_text(size=16, face="bold", vjust=-1),
plot.subtitle=element_text(size=12, face="italic", color="grey"))
Below are the summary statistics of the variables we use in the models.
stargazer(vars_train %>%
dplyr::select('departure', 'arrival', 'dist_Park1', 'log_dist_ferryBuilding',
'log_dist_bus3', 'log_dist_Townsend', 'log_dist_crash1',
'log_dist_bus1', 'Condition', 'Wind.Speed', 'log_jobs', 'pop',
'log_dist_crime1', 'log_transit', 'houseAge', 'log_housingUnits',
'housingUnits', 'log_bike', 'log_pct_VACANT', 'MIPS', 'pct_VACANT',
'log_MED', 'whichday', 'departure_3h', 'departure_24h', 'departure_1w',
'arrival_2h', 'arrival_3h', 'arrival_1w', 'arrival_1w1h',
'nearest5_departure_2h', 'nearest5_departure_1w', 'nearest5_arrival_2h',
'nearest5_arrival_1w', 'nearest5_arrival_1w1h'),
type="text",
title = "Summary Statistics")
##
## Summary Statistics
## ============================================================================================================
## Statistic N Mean St. Dev. Min Pctl(25) Pctl(75) Max
## ------------------------------------------------------------------------------------------------------------
## departure 18,144 1.781 3.137 0 0 2 57
## arrival 18,144 1.783 3.515 0 0 2 95
## dist_Park1 18,144 1,068.967 720.947 78.879 500.656 1,460.632 3,386.657
## log_dist_ferryBuilding 18,144 9.030 1.031 0.000 8.734 9.582 10.091
## log_dist_bus3 18,144 6.989 0.607 5.125 6.730 7.280 8.582
## log_dist_Townsend 18,144 8.691 1.018 0.000 8.421 9.260 9.759
## log_dist_crash1 18,144 5.244 0.972 2.635 4.588 5.926 6.828
## log_dist_bus1 18,144 6.480 0.831 3.287 6.041 6.987 8.019
## Wind.Speed 18,144 3.405 2.815 0 1 6 9
## log_jobs 18,144 2.706 1.807 0.000 1.386 3.760 6.824
## pop 18,144 2,065.451 2,342.418 0 991 1,942.2 10,264
## log_dist_crime1 18,144 4.039 0.791 0.875 3.596 4.614 6.158
## log_transit 18,144 5.591 1.050 0.000 5.163 6.133 7.845
## houseAge 18,144 52.083 28.150 9 18 79 79
## log_housingUnits 18,144 6.664 0.986 0.000 6.260 7.229 8.568
## housingUnits 18,144 1,157.764 1,216.132 0 522.2 1,378 5,262
## log_bike 18,144 3.128 1.942 0.000 2.247 4.673 5.727
## log_pct_VACANT 18,144 -2.538 0.760 -4.667 -3.020 -1.868 -1.194
## MIPS 18,144 1,004,980.000 1,034,868.000 37,111.230 223,440.100 1,322,705.000 4,079,469.000
## pct_VACANT 18,144 0.101 0.067 0.009 0.049 0.154 0.303
## log_MED 18,144 9.583 2.565 0.000 9.149 10.963 12.798
## departure_3h 18,144 1.713 3.148 0 0 2 57
## departure_24h 18,144 1.784 3.133 0 0 2 57
## departure_1w 18,144 1.779 3.217 0 0 2 58
## arrival_2h 18,144 1.752 3.521 0 0 2 95
## arrival_3h 18,144 1.715 3.527 0 0 2 95
## arrival_1w 18,144 1.781 3.555 0 0 2 83
## arrival_1w1h 18,144 1.774 3.557 0 0 2 83
## nearest5_departure_2h 18,144 1.786 2.335 0 0.4 2.2 29
## nearest5_departure_1w 18,144 1.813 2.347 0 0.4 2.2 27
## nearest5_arrival_2h 18,144 1.776 2.509 0 0.4 2.2 30
## nearest5_arrival_1w 18,144 1.800 2.502 0 0.4 2.2 30
## nearest5_arrival_1w1h 18,144 1.792 2.507 0 0.4 2.2 30
## ------------------------------------------------------------------------------------------------------------
We apply the same methodology, i.e. forward stepwise, when building both models. We start with no variables in the model, test the model’s overall MAE after the addition of each variable, add the statistically significant variable whose inclusion gives the largest amount of MAE decrease, and repeat this process until none improves the model.
Results show that a large majority of preditors are significant.
glm1 <- glm(departure ~ eastMost + nearest5_departure_1w + whichday + departure_1w + Condition +
arrival_3h + Wind.Speed + log_jobs + nearest5_arrival_2h + nearest5_arrival_1w1h +
log_dist_Townsend + log_dist_crash1 + departure_3h + log_dist_bus1 + arrival_1w1h +
log_pct_VACANT + weekend_hot + pop + log_dist_crime1 + log_transit + dist_Park1 +
log_dist_ferryBuilding + houseAge + log_housingUnits,
family = "poisson",
data = vars_train)
glm2 <- glm(arrival ~ eastMost + arrival_1w + whichday + nearest5_arrival_1w + arrival_2h +
housingUnits + log_dist_Townsend + MIPS + weekend_hot + departure_1w +
Condition + Wind.Speed + arrival_3h + nearest5_departure_2h + log_dist_crash1 +
nearest5_arrival_1w1h + pct_VACANT + log_bike + departure_24h + log_dist_bus3 +
log_MED,
family = "poisson",
data = vars_train)
stargazer(glm1, glm2, type="text")
##
## ===========================================================
## Dependent variable:
## ----------------------------
## departure arrival
## (1) (2)
## -----------------------------------------------------------
## eastMostout -0.237*** -0.400***
## (0.025) (0.022)
##
## nearest5_departure_1w 0.085***
## (0.002)
##
## arrival_1w 0.051***
## (0.001)
##
## whichdayMon 0.042** 0.053**
## (0.020) (0.021)
##
## whichdaySat -0.402*** -0.408***
## (0.025) (0.025)
##
## whichdaySun -0.543*** -0.529***
## (0.026) (0.026)
##
## whichdayThur -0.173*** -0.144***
## (0.024) (0.024)
##
## whichdayTue -0.051** -0.067***
## (0.020) (0.020)
##
## whichdayWed 0.022 -0.010
## (0.020) (0.020)
##
## nearest5_arrival_1w 0.061***
## (0.002)
##
## arrival_2h 0.019***
## (0.001)
##
## housingUnits -0.00002***
## (0.00001)
##
## departure_1w 0.057*** 0.019***
## (0.001) (0.001)
##
## ConditionFair -0.324*** -0.271***
## (0.033) (0.033)
##
## ConditionFair / Windy -0.020 -0.261***
## (0.039) (0.040)
##
## ConditionMostly Cloudy -0.257*** -0.271***
## (0.039) (0.039)
##
## ConditionPartly Cloudy -0.299*** -0.297***
## (0.045) (0.045)
##
## ConditionPartly Cloudy / Windy -0.168*** -0.276***
## (0.038) (0.039)
##
## arrival_3h 0.019*** 0.012***
## (0.001) (0.001)
##
## nearest5_departure_2h -0.024***
## (0.003)
##
## Wind.Speed -0.013*** -0.016***
## (0.002) (0.002)
##
## log_jobs 0.046***
## (0.005)
##
## nearest5_arrival_2h -0.039***
## (0.003)
##
## nearest5_arrival_1w1h 0.020*** 0.021***
## (0.003) (0.002)
##
## pct_VACANT 1.149***
## (0.157)
##
## log_bike 0.009**
## (0.004)
##
## departure_24h 0.008***
## (0.001)
##
## log_dist_bus3 -0.054***
## (0.012)
##
## log_MED 0.012***
## (0.003)
##
## log_dist_Townsend 0.051*** 0.096***
## (0.006) (0.006)
##
## MIPS 0.00000***
## (0.000)
##
## log_dist_crash1 -0.061*** -0.061***
## (0.007) (0.006)
##
## departure_3h 0.021***
## (0.002)
##
## log_dist_bus1 -0.036***
## (0.008)
##
## arrival_1w1h 0.016***
## (0.001)
##
## log_pct_VACANT 0.064***
## (0.013)
##
## weekend_hot 0.729*** 0.783***
## (0.040) (0.039)
##
## pop -0.00003***
## (0.00000)
##
## log_dist_crime1 0.031***
## (0.009)
##
## log_transit 0.037***
## (0.014)
##
## dist_Park1 -0.00004***
## (0.00001)
##
## log_dist_ferryBuilding 0.015***
## (0.005)
##
## houseAge -0.001***
## (0.0004)
##
## log_housingUnits 0.038***
## (0.014)
##
## Constant 0.201 0.242**
## (0.125) (0.114)
##
## -----------------------------------------------------------
## Observations 18,144 18,144
## Log Likelihood -31,188.740 -31,445.740
## Akaike Inf. Crit. 62,445.480 62,953.480
## ===========================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
The histograms below show the distribution of residuals of departure model and arrival model, respectively. Both models tend to underpredict the dependent variable, as a large number of residuals are around minus one. But overall there does not exist very extreme figures, so these two models can be considered accurate.
glm1Attributes <-
data.frame(station_id = vars_train$station_id,
station = vars_train$station,
observedDeparture = vars_train$departure,
predictedDeparture = glm1$fitted.values,
residuals = glm1$residuals)
ggplot(glm1Attributes) +
geom_histogram(aes(x = residuals), fill = '#b8d1e3') +
scale_x_continuous(breaks = seq(min(glm1Attributes$residuals)-1, max(glm1Attributes$residuals)+1, 2)) +
labs(title = 'Distribution of Residuals',
subtitle = 'Bike Departure Prediction for San Francisco, CA',
x="Residuals",
y="Count") +
theme_minimal() +
theme(plot.title=element_text(size=18, face="bold", vjust=-1),
plot.subtitle=element_text(size=12, face="italic", color="grey"))
glm2Attributes <-
data.frame(station_id = vars_train$station_id,
station = vars_train$station,
observedArrival = vars_train$arrival,
predictedArrival = glm2$fitted.values,
residuals = glm2$residuals)
ggplot(glm2Attributes) +
geom_histogram(aes(x = residuals), fill = '#b8d1e3') +
scale_x_continuous(breaks = seq(min(glm2Attributes$residuals)-1, max(glm2Attributes$residuals)+1, 2)) +
labs(title = 'Distribution of Residuals',
subtitle = 'Bike Arrival Prediction for San Francisco, CA',
x="Residuals",
y="Count") +
theme_minimal() +
theme(plot.title=element_text(size=18, face="bold", vjust=-1),
plot.subtitle=element_text(size=12, face="italic", color="grey"))
To evaluate whether our models are able to perform similarly well in the datasets they have never seen, we select the bike trips on Oct. 21st, which is a Sunday, and Oct. 22nd, which is a Monday, as our test sets. The metric used here is the Mean Absolute Error (MAE), which measures the average difference of predicted and observed departures/arrivals in absolute values. For instance, if the departure MAE equals 2 in a particular day, it means on average, the predicted departure differs from observed departure by an amount of 2. There is another metric called Mean Absolute Percentage Error (MAPE), which is used quite often as well. This analysis chooses MAE for the following reasons: 1. Many observed values equal 0, leading to a MAPE of infinity; 2. The observed value being 1 and the predicted value being 2 will result in a MAPE of 100%, while the observed value being 10 and the predicted value being 20 will result in the exact same MAPE. Nonetheless, for count variables (in this case, departure and arrival), the impact of the former is much more marginal.
glm1.1PredValues <-
data.frame(station_name = vars_test1$station,
neighborhood = vars_test1$nhood,
station_id = as.numeric(as.character(vars_test1$station_id)),
hour = as.numeric(as.character(vars_test1$hour)),
geometry = vars_test1$geometry,
observedDeparture = vars_test1$departure,
predictedDeparture = predict(glm1, vars_test1)) %>%
mutate(residual = observedDeparture - predictedDeparture) %>%
mutate(absError = abs(predictedDeparture - observedDeparture)) %>%
na.omit()
glm1.2PredValues <-
data.frame(station_name = vars_test2$station,
neighborhood = vars_test1$nhood,
station_id = as.numeric(as.character(vars_test2$station_id)),
hour = as.numeric(as.character(vars_test2$hour)),
geometry = vars_test2$geometry,
observedDeparture = vars_test2$departure,
predictedDeparture = predict(glm1, vars_test2)) %>%
mutate(residual = observedDeparture - predictedDeparture) %>%
mutate(absError = abs(predictedDeparture - observedDeparture)) %>%
na.omit()
glm2.1PredValues <-
data.frame(station_name = vars_test1$station,
neighborhood = vars_test1$nhood,
station_id = as.numeric(as.character(vars_test1$station_id)),
hour = as.numeric(as.character(vars_test1$hour)),
geometry = vars_test1$geometry,
observedArrival = vars_test1$arrival,
predictedArrival = predict(glm2, vars_test1)) %>%
mutate(residual = observedArrival - predictedArrival) %>%
mutate(absError = abs(predictedArrival - observedArrival)) %>%
na.omit()
glm2.2PredValues <-
data.frame(station_name = vars_test2$station,
neighborhood = vars_test1$nhood,
station_id = as.numeric(as.character(vars_test2$station_id)),
hour = as.numeric(as.character(vars_test2$hour)),
geometry = vars_test2$geometry,
observedArrival = vars_test2$departure,
predictedArrival = predict(glm2, vars_test2)) %>%
mutate(residual = observedArrival - predictedArrival) %>%
mutate(absError = abs(predictedArrival - observedArrival)) %>%
na.omit()
Illustrated by the table below, the performance of both models are similar. Admittedly, the MAEs are lower in the Oct. 21st testset, but it does not mean that our models are not generalizable. Typically, there are more trips on weekdays than at weekends, and it is indeed reasonable that times with larger number of trips come with higher MAEs.
data.frame('Weekday' = c(mean(glm1.1PredValues$absError), mean(glm2.1PredValues$absError)),
'Weekend' = c(mean(glm1.2PredValues$absError), mean(glm2.2PredValues$absError)),
row.names = c('Departure', 'Arrival')) %>%
kable(., caption = 'Comparison of MAE') %>%
kable_styling(bootstrap_options = c("striped", "hover"),
font_size = 15,
full_width = F) %>%
row_spec(seq(1, 2, 2), background = "#ecf6fe")
| Weekday | Weekend | |
|---|---|---|
| Departure | 1.639550 | 1.047264 |
| Arrival | 1.678003 | 1.079387 |
Cross-validation allows us to see how generalizable our model is to a number of random samples, instead of two samples in the previous section. Here, we used an algorithm called “100-fold cross-validation”. This methodology allows us to first partition the entire dataset into 100 equally sized subsets, hold out one of those subsets as the test set, train the model using the remaining 99 subsets, predict for the hold out subset, and record a goodness of fit metric. The average of the goodness of fit metrics across all 100 folds will also be generated.
fitControl <- trainControl(method = "cv", number = 100)
set.seed(825)
glm1Fit <- train(departure ~ eastMost + nearest5_departure_1w + whichday + departure_1w + Condition + arrival_3h +
Wind.Speed + log_jobs + nearest5_arrival_2h + nearest5_arrival_1w1h + log_dist_Townsend +
log_dist_crash1 + log_dist_bike3 + arrival_1w1h + departure_3h + log_dist_bus1 + log_pct_VACANT +
dist_School5 + arrival_24h + nearest3_arrival_24h + log_dist_crime1 + log_dist_ferryBuilding +
dist_Park1 + log_dist_retail3 + log_dist_school1,
data = vars_train,
family = "poisson",
method = "glm",
trControl = fitControl)
glm2Fit <- train(arrival ~ eastMost + arrival_1w + whichday + nearest5_arrival_1w + arrival_2h +
housingUnits + log_dist_Townsend + MIPS + weekend_hot + departure_1w +
Condition + Wind.Speed + arrival_3h + nearest5_departure_2h + log_dist_crash1 +
nearest5_arrival_1w1h + pct_VACANT + log_bike + departure_24h + log_dist_bus3 +
log_MED,
data = vars_train,
family = "poisson",
method = "glm",
trControl = fitControl)
The mean of MAE for the departure model equals 1.39, indicating the average difference between predicted and observed departures is 1.39. The MAE of 1.43 means the average difference between predicted and observed arrivals is equal to 1.43.
glm1Fit; glm2Fit
## Generalized Linear Model
##
## 18144 samples
## 25 predictor
##
## No pre-processing
## Resampling: Cross-Validated (100 fold)
## Summary of sample sizes: 17962, 17963, 17962, 17962, 17963, 17962, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 2.423578 0.5257128 1.393558
## Generalized Linear Model
##
## 18144 samples
## 21 predictor
##
## No pre-processing
## Resampling: Cross-Validated (100 fold)
## Summary of sample sizes: 17963, 17963, 17961, 17963, 17963, 17963, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 2.86953 0.5295127 1.432169
The two histograms below show the distribution of MAE for the two models, respectively. For the departure model, the condensed distribution indicates that the model performance is stable across datasets. When it comes to the arrival model, what might worth noticing is that there are two samples in which the MAEs are above 2.5. The arrival model does not perform very well in these subsets. Beyond that, the other MAE values are distributed around 1.4-1.5.
ggplot(as.data.frame(glm1Fit$resample), aes(MAE)) +
geom_histogram(bins=15, fill='#b8d1e3') +
labs(title = '100-Fold Cross-Validation MAE',
subtitle = 'Bike Departure Prediction for San Francisco, CA',
x="Mean Absolute Error",
y="Count") +
theme_minimal() +
theme(plot.title=element_text(size=17, face="bold", vjust=-1),
plot.subtitle=element_text(size=12, face="italic", color="grey"))
ggplot(as.data.frame(glm2Fit$resample), aes(MAE)) +
geom_histogram(bins=15, fill='#b8d1e3') +
labs(title = '100-Fold Cross-Validation MAE',
subtitle = 'Bike Arrival Prediction for San Francisco, CA',
x="Mean Absolute Error",
y="Count") +
theme_minimal() +
theme(plot.title=element_text(size=17, face="bold", vjust=-1),
plot.subtitle=element_text(size=12, face="italic", color="grey"))
After examining the models’ generalizability across datasets, we next decide to assess whether they are generalizable across space. In this section, we will take a look at two geographies - stations and neighborhoods.
To calculate MAE by each station, for each of the two test sets, we take the mean of absolute error for every station in the given days.
As shown by the following maps, departure and arrival models display similar patterns. On weekdays, stations with relatively higher MAEs are located in the northeast of San Francisco, while at weekends, it seems that those with higher MAES are concentrated in the southwest instead.
station_weekdayDepartureMAE <-
glm1.1PredValues %>%
group_by(station_name) %>%
summarize(MAE = mean(absError),
observed = sum(observedDeparture)) %>%
inner_join(sf_bike, ., by = c('start_station_name' = 'station_name')) %>%
st_transform(crs = 4326) %>%
mutate(lon = sf_bike_WGS.xy[,1],
lat = sf_bike_WGS.xy[,2])
ggmap(map) +
geom_point(data=station_weekdayDepartureMAE,
aes(x=lon, y=lat,
colour=factor(ntile(MAE,5)), size=factor(ntile(observed,5)))) +
scale_colour_manual(values = c("#d6d6ff","#8f97e3","#556cc9","#244ead","#003994"),
labels=as.character(round(quantile(station_weekdayDepartureMAE$MAE,
c(.1,.2,.4,.6,.8),na.rm=T),3)),
name="MAE \n(Quintile Breaks)") +
scale_size_manual(values = c(1, 2, 3, 4, 5),
labels=as.character(round(quantile(station_weekdayDepartureMAE$observed,
c(.1,.2,.4,.6,.8),na.rm=T),3)),
name="Observed Departure \n(Quintile Breaks)") +
labs(title = "Departure MAE by Bikeshare Station",
subtitle = 'Weekday Testset') +
theme(plot.title=element_text(size=17, face="bold", vjust=-1),
plot.subtitle=element_text(size=12, face="italic", color="grey"),
legend.text = element_text(face = "italic"),
axis.ticks = element_blank(),
axis.text = element_blank(),
axis.title.x = element_blank(),
axis.title.y = element_blank(),
panel.border = element_rect(colour = "grey", fill=NA, size=2))
station_weekendDepartureMAE <-
glm1.2PredValues %>%
group_by(station_name) %>%
summarize(MAE = mean(absError),
observed = sum(observedDeparture)) %>%
inner_join(sf_bike, ., by = c('start_station_name' = 'station_name')) %>%
st_transform(crs = 4326) %>%
mutate(lon = sf_bike_WGS.xy[,1],
lat = sf_bike_WGS.xy[,2])
ggmap(map) +
geom_point(data=station_weekendDepartureMAE,
aes(x=lon, y=lat,
colour=factor(ntile(MAE,5)), size=factor(ntile(observed,5)))) +
scale_colour_manual(values = c("#d6d6ff","#8f97e3","#556cc9","#244ead","#003994"),
labels=as.character(round(quantile(station_weekendDepartureMAE$MAE,
c(.1,.2,.4,.6,.8),na.rm=T),3)),
name="MAE \n(Quintile Breaks)") +
scale_size_manual(values = c(1, 2, 3, 4, 5),
labels=as.character(round(quantile(station_weekendDepartureMAE$observed,
c(.1,.2,.4,.6,.8),na.rm=T),3)),
name="Observed Departure \n(Quintile Breaks)") +
labs(title = "Departure MAE by Bikeshare Station",
subtitle = 'Weekend Testset') +
theme(plot.title=element_text(size=17, face="bold", vjust=-1),
plot.subtitle=element_text(size=12, face="italic", color="grey"),
legend.text = element_text(face = "italic"),
axis.ticks = element_blank(),
axis.text = element_blank(),
axis.title.x = element_blank(),
axis.title.y = element_blank(),
panel.border = element_rect(colour = "grey", fill=NA, size=2))
station_weekdayArrivalMAE <-
glm2.1PredValues %>%
group_by(station_name) %>%
summarize(MAE = mean(absError),
observed = sum(observedArrival)) %>%
inner_join(sf_bike, ., by = c('start_station_name' = 'station_name')) %>%
st_transform(crs = 4326) %>%
mutate(lon = sf_bike_WGS.xy[,1],
lat = sf_bike_WGS.xy[,2])
ggmap(map) +
geom_point(data=station_weekdayArrivalMAE,
aes(x=lon, y=lat,
colour=factor(ntile(MAE,5)), size=factor(ntile(observed,5)))) +
scale_colour_manual(values = c("#d6d6ff","#8f97e3","#556cc9","#244ead","#003994"),
labels=as.character(round(quantile(station_weekdayArrivalMAE$MAE,
c(.1,.2,.4,.6,.8),na.rm=T),3)),
name="MAE \n(Quintile Breaks)") +
scale_size_manual(values = c(1, 2, 3, 4, 5),
labels=as.character(round(quantile(station_weekdayArrivalMAE$observed,
c(.1,.2,.4,.6,.8),na.rm=T),3)),
name="Observed Arrival \n(Quintile Breaks)") +
labs(title = "Arrival MAE by Bikeshare Station",
subtitle = 'Weekday Testset') +
theme(plot.title=element_text(size=17, face="bold", vjust=-1),
plot.subtitle=element_text(size=12, face="italic", color="grey"),
legend.text = element_text(face = "italic"),
axis.ticks = element_blank(),
axis.text = element_blank(),
axis.title.x = element_blank(),
axis.title.y = element_blank(),
panel.border = element_rect(colour = "grey", fill=NA, size=2))
station_weekendArrivalMAE <-
glm2.2PredValues %>%
group_by(station_name) %>%
summarize(MAE = mean(absError),
observed = sum(observedArrival)) %>%
inner_join(sf_bike, ., by = c('start_station_name' = 'station_name')) %>%
st_transform(crs = 4326) %>%
mutate(lon = sf_bike_WGS.xy[,1],
lat = sf_bike_WGS.xy[,2])
ggmap(map) +
geom_point(data=station_weekendArrivalMAE,
aes(x=lon, y=lat,
colour=factor(ntile(MAE,5)), size=factor(ntile(observed,5)))) +
scale_colour_manual(values = c("#d6d6ff","#8f97e3","#556cc9","#244ead","#003994"),
labels=as.character(round(quantile(station_weekendArrivalMAE$MAE,
c(.1,.2,.4,.6,.8),na.rm=T),3)),
name="MAE \n(Quintile Breaks)") +
scale_size_manual(values = c(1, 2, 3, 4, 5),
labels=as.character(round(quantile(station_weekendArrivalMAE$observed,
c(.1,.2,.4,.6,.8),na.rm=T),3)),
name="Observed Arrival \n(Quintile Breaks)") +
labs(title = "Arrival MAE by Bikeshare Station",
subtitle = 'Weekend Testset') +
theme(plot.title=element_text(size=17, face="bold", vjust=-1),
plot.subtitle=element_text(size=12, face="italic", color="grey"),
legend.text = element_text(face = "italic"),
axis.ticks = element_blank(),
axis.text = element_blank(),
axis.title.x = element_blank(),
axis.title.y = element_blank(),
panel.border = element_rect(colour = "grey", fill=NA, size=2))
However, this does not mean that our models are not generalizable across stations. We can notice from above maps that stations with higher MAEs tend to have higher number of trips. To explore further, we are going to see whether there is a relationship between the number of trips and MAE. The four scatter plots below basically tell the same story. The Mean Absolute Error is positively correlated with the observed departures/arrivals. That is to say, it is true that stations with higher number of trips normally are also those with larger MAEs. As previously discussed, this phenomenon is indeed reasonable, and does not prove that our models fail to generalize across space.
ggplot(data = glm1.1PredValues %>%
group_by(station_name) %>%
summarize(departure = sum(observedDeparture),
MAE = mean(absError))) +
geom_point(aes(x = departure, y = MAE)) +
stat_smooth(aes(x=departure, y=MAE), method = "lm", se = FALSE, size = 0.7, colour="#2f597a") +
labs(title = 'Departure MAE as function of Observed Departure',
subtitle = 'Weekday Testset ',
x="Observed Departure",
y="MAE") +
theme_minimal() +
theme(plot.title=element_text(size=17, face="bold", vjust=-1),
plot.subtitle=element_text(size=12, face="italic", color="grey"))
ggplot(data = glm1.2PredValues %>%
group_by(station_name) %>%
summarize(departure = sum(observedDeparture),
MAE = mean(absError))) +
geom_point(aes(x = departure, y = MAE)) +
stat_smooth(aes(x=departure, y=MAE), method = "lm", se = FALSE, size = 0.7, colour="#2f597a") +
labs(title = 'Departure MAE as function of Observed Departure',
subtitle = 'Weekend Testset',
x="Observed Departure",
y="MAE") +
theme_minimal() +
theme(plot.title=element_text(size=17, face="bold", vjust=-1),
plot.subtitle=element_text(size=12, face="italic", color="grey"))
ggplot(data = glm2.1PredValues %>%
group_by(station_name) %>%
summarize(Arrival = sum(observedArrival),
MAE = mean(absError))) +
geom_point(aes(x = Arrival, y = MAE)) +
stat_smooth(aes(x = Arrival, y = MAE), method = "lm", se = FALSE, size = 0.7, colour="#2f597a") +
labs(title = 'Arrival MAE as function of Observed Arrival',
subtitle = 'Weekday Testset',
x="Observed Arrival",
y="MAE") +
theme_minimal() +
theme(plot.title=element_text(size=17, face="bold", vjust=-1),
plot.subtitle=element_text(size=12, face="italic", color="grey"))
ggplot(data = glm2.2PredValues %>%
group_by(station_name) %>%
summarize(Arrival = sum(observedArrival),
MAE = mean(absError))) +
geom_point(aes(x = Arrival, y = MAE)) +
stat_smooth(aes(x = Arrival, y = MAE), method = "lm", se = FALSE, size = 0.7, colour="#2f597a") +
labs(title = 'Arrival MAE as function of Observed Arrival',
subtitle = 'Weekend Testset',
x="Observed Arrival",
y="MAE") +
theme_minimal() +
theme(plot.title=element_text(size=17, face="bold", vjust=-1),
plot.subtitle=element_text(size=12, face="italic", color="grey"))
To calculate MAE by each neighborhood, for each of the two test sets, we take the mean of absolute error for every neighborhood in the given days.
The evaluation approach we apply here is similar to the one used towards stations. The only difference is that we measure the mean of departure/arrival rather than the sum of departure/arrival as in the previous section. The mean of departure/arrival here means the average number of observed values per station per hour. The reason why we use this metric is that the number of bikeshare stations in each neighborhood varies considerably from one another, probably resulting in a bias in the total number of departure/arrival in neighborhoods.
The four scatter plots below suggest that our models’ performances are consistent across neighborhoods whichever test set we use. It indicates both models are generalizable not only across stations but also across neighborhoods.
glm1.1PredValues %>%
group_by(neighborhood) %>%
summarize(MAE = mean(absError),
Observed = mean(observedDeparture)) %>%
ggplot() +
geom_point(aes(x = Observed, y = MAE)) +
stat_smooth(aes(x = Observed, y = MAE), method = 'lm', se = FALSE, size = 0.7, colour="#2f597a") +
labs(title = 'Departure MAE as function of Observed Departure \n(Neighborhood Level)',
subtitle = 'Weekday Testset ',
x="Average Observed Departure",
y="MAE") +
theme_minimal() +
theme(plot.title=element_text(size=18, face="bold", vjust=-1),
plot.subtitle=element_text(size=12, face="italic", color="grey"))
glm1.2PredValues %>%
group_by(neighborhood) %>%
summarize(MAE = mean(absError),
Observed = mean(observedDeparture)) %>%
ggplot() +
geom_point(aes(x = Observed, y = MAE)) +
stat_smooth(aes(x = Observed, y = MAE), method = 'lm', se = FALSE, size = 0.7, colour="#2f597a") +
labs(title = 'Departure MAE as function of Observed Departure \n(Neighborhood Level)',
subtitle = 'Weekend Testset',
x="Average Observed Departure",
y="MAE") +
theme_minimal() +
theme(plot.title=element_text(size=18, face="bold", vjust=-1),
plot.subtitle=element_text(size=12, face="italic", color="grey"))
glm2.1PredValues %>%
group_by(neighborhood) %>%
summarize(MAE = mean(absError),
Observed = mean(observedArrival)) %>%
ggplot() +
geom_point(aes(x = Observed, y = MAE)) +
stat_smooth(aes(x = Observed, y = MAE), method = 'lm', se = FALSE, size = 0.7, colour="#2f597a") +
labs(title = 'Arrival MAE as function of Observed Arrival \n(Neighborhood Level)',
subtitle = 'Weekday Testset',
x="Average Observed Arrival",
y="MAE") +
theme_minimal() +
theme(plot.title=element_text(size=18, face="bold", vjust=-1),
plot.subtitle=element_text(size=12, face="italic", color="grey"))
glm2.2PredValues %>%
group_by(neighborhood) %>%
summarize(MAE = mean(absError),
Observed = mean(observedArrival)) %>%
ggplot() +
geom_point(aes(x = Observed, y = MAE)) +
stat_smooth(aes(x = Observed, y = MAE), method = 'lm', se = FALSE, size = 0.7, colour="#2f597a") +
labs(title = 'Arrival MAE as function of Observed Arrival \n(Neighborhood Level)',
subtitle = 'Weekend Testset',
x="Average Observed Arrival",
y="MAE") +
theme_minimal() +
theme(plot.title=element_text(size=18, face="bold", vjust=-1),
plot.subtitle=element_text(size=12, face="italic", color="grey"))
Both departure and arrival models are generalizable across time. The facetted line plots illustrate the trends displayed by MAE is almost identical to the ones of according observed departure/arrival. The performances of our models are consistent across the day, since the prediction errors are relatively larger when the observed values are high, and they are relatively smaller when the observed values are low.
ggplot(glm1.1PredValues %>%
group_by(hour) %>%
summarize(MAE = mean(absError),
Departure = sum(observedDeparture)) %>%
gather(Variable, Value, MAE:Departure)) +
geom_point(aes(x = hour, y = Value), color = '#2f597a') +
geom_line(aes(x = hour, y = Value), color = '#2f597a') +
facet_wrap(~Variable, scales = "free") +
scale_x_continuous(breaks = seq(6, 23, 1)) +
labs(title = 'Observed Departure vs. MAE',
subtitle = 'Weekday Testset',
x="Hour",
y="Value") +
theme_minimal() +
theme(plot.title=element_text(size=17, face="bold", vjust=-1),
plot.subtitle=element_text(size=12, face="italic", color="grey"))
ggplot(glm1.2PredValues %>%
group_by(hour) %>%
summarize(MAE = mean(absError),
Departure = sum(observedDeparture)) %>%
gather(Variable, Value, MAE:Departure)) +
geom_point(aes(x = hour, y = Value), color = '#2f597a') +
geom_line(aes(x = hour, y = Value), color = '#2f597a') +
facet_wrap(~Variable, scales = "free") +
scale_x_continuous(breaks = seq(6, 23, 1)) +
labs(title = 'Observed Departure vs. MAE',
subtitle = 'Weekend Testset',
x="Hour",
y="Value") +
theme_minimal() +
theme(plot.title=element_text(size=17, face="bold", vjust=-1),
plot.subtitle=element_text(size=12, face="italic", color="grey"))
ggplot(glm2.1PredValues %>%
group_by(hour) %>%
summarize(MAE = mean(absError),
Arrival = sum(observedArrival)) %>%
gather(Variable, Value, MAE:Arrival)) +
geom_point(aes(x = hour, y = Value), color = '#2f597a') +
geom_line(aes(x = hour, y = Value), color = '#2f597a') +
facet_wrap(~Variable, scales = "free") +
scale_x_continuous(breaks = seq(6, 23, 1)) +
labs(title = 'Observed Arrival vs. MAE',
subtitle = 'Weekday Testset',
x="Hour",
y="Value") +
theme_minimal() +
theme(plot.title=element_text(size=17, face="bold", vjust=-1),
plot.subtitle=element_text(size=12, face="italic", color="grey"))
ggplot(glm2.2PredValues %>%
group_by(hour) %>%
summarize(MAE = mean(absError),
Arrival = sum(observedArrival)) %>%
gather(Variable, Value, MAE:Arrival)) +
geom_point(aes(x = hour, y = Value), color = '#2f597a') +
geom_line(aes(x = hour, y = Value), color = '#2f597a') +
facet_wrap(~Variable, scales = "free") +
scale_x_continuous(breaks = seq(6, 23, 1)) +
labs(title = 'Observed Arrival vs. MAE',
subtitle = 'Weekend Testset',
x="Hour",
y="Value") +
theme_minimal() +
theme(plot.title=element_text(size=17, face="bold", vjust=-1),
plot.subtitle=element_text(size=12, face="italic", color="grey"))
To check whether there exists spatial autocorrelation in the residuals, we first randomly select an hour in both weekday and weekend testsets, and perform Moran’s I tests. Illustrated by the table below, the p-values of the four tests are higher than 0.05, indicating that spatial autocorrelation of the residuals is not significant.
## Weekday Departure
set.seed(49)
sample_hour <- sample(6:23, 1)
glm1.1Attributes <-
data.frame(residual = glm1.1PredValues %>%
filter(hour == sample_hour) %>%
select(residual),
Longitude = sf_bike_WGS.xy[, 1],
Latitude = sf_bike_WGS.xy[, 2])
coords1.1 <- cbind(glm1.1Attributes$Longitude, glm1.1Attributes$Latitude)
spatialWeights1.1 <- knn2nb(knearneigh(coords1.1, 4))
## Weekend Departure
glm1.2Attributes <-
data.frame(residual = glm1.2PredValues %>%
filter(hour == sample_hour) %>%
dplyr::select(residual),
Longitude = sf_bike_WGS.xy[, 1],
Latitude = sf_bike_WGS.xy[, 2])
coords1.2 <- cbind(glm1.2Attributes$Longitude, glm1.2Attributes$Latitude)
spatialWeights1.2 <- knn2nb(knearneigh(coords1.2, 4))
## Weekday Arrival
glm2.1Attributes <-
data.frame(residual = glm2.1PredValues %>%
filter(hour == sample_hour) %>%
dplyr::select(residual),
Longitude = sf_bike_WGS.xy[, 1],
Latitude = sf_bike_WGS.xy[, 2])
coords2.1 <- cbind(glm2.1Attributes$Longitude, glm2.1Attributes$Latitude)
spatialWeights2.1 <- knn2nb(knearneigh(coords2.1, 4))
## Weekend Arrival
glm2.2Attributes <-
data.frame(residual = glm2.2PredValues %>%
filter(hour == sample_hour) %>%
dplyr::select(residual),
Longitude = sf_bike_WGS.xy[, 1],
Latitude = sf_bike_WGS.xy[, 2])
coords2.2 <- cbind(glm2.2Attributes$Longitude, glm2.2Attributes$Latitude)
spatialWeights2.2 <- knn2nb(knearneigh(coords2.2, 4))
data.frame('Departure' = c(moran.test(glm1.1Attributes$residual, nb2listw(spatialWeights1.1, style="W"))$p.value,
moran.test(glm1.2Attributes$residual, nb2listw(spatialWeights1.1, style="W"))$p.value),
'Arrival' = c(moran.test(glm2.1Attributes$residual, nb2listw(spatialWeights1.1, style="W"))$p.value,
moran.test(glm2.2Attributes$residual, nb2listw(spatialWeights1.1, style="W"))$p.value),
row.names = c('Weekday', 'Weekend')) %>%
kable(., caption = "P-values of Moran's I Test") %>%
kable_styling(bootstrap_options = c("striped", "hover"),
font_size = 15,
full_width = F) %>%
row_spec(seq(1, 2, 2), background = "#ecf6fe")
| Departure | Arrival | |
|---|---|---|
| Weekday | 0.1858295 | 0.7841735 |
| Weekend | 0.7351573 | 0.7004535 |
We also make four maps in order to examine spatial autocorrelation directly.
ggmap(map) +
geom_point(data=glm1.1Attributes,
aes(x=Longitude, y=Latitude, colour=factor(ntile(residual,5))),
size=1.9) +
scale_colour_manual(values = c("#d6d6ff","#8f97e3","#556cc9","#244ead","#003994"),
labels=as.character(round(quantile(glm1.1Attributes$residual,
c(.1,.2,.4,.6,.8),na.rm=T),3)),
name="Residual \n(Quintile Breaks)") +
labs(title = "Departure Residual by Bikeshare Station",
subtitle = 'Weekday Testset; Randomly Selected Hour') +
theme(plot.title=element_text(size=17, face="bold", vjust=-1),
plot.subtitle=element_text(size=12, face="italic", color="grey"),
legend.text = element_text(face = "italic"),
axis.ticks = element_blank(),
axis.text = element_blank(),
axis.title.x = element_blank(),
axis.title.y = element_blank(),
panel.border = element_rect(colour = "grey", fill=NA, size=2))
ggmap(map) +
geom_point(data=glm1.2Attributes,
aes(x=Longitude, y=Latitude, colour=factor(ntile(residual,5))),
size=1.9) +
scale_colour_manual(values = c("#d6d6ff","#8f97e3","#556cc9","#244ead","#003994"),
labels=as.character(round(quantile(glm1.2Attributes$residual,
c(.1,.2,.4,.6,.8),na.rm=T),3)),
name="Residual \n(Quintile Breaks)") +
labs(title = "Departure Residual by Bikeshare Station",
subtitle = 'Weekend Testset; Randomly Selected Hour') +
theme(plot.title=element_text(size=17, face="bold", vjust=-1),
plot.subtitle=element_text(size=12, face="italic", color="grey"),
legend.text = element_text(face = "italic"),
axis.ticks = element_blank(),
axis.text = element_blank(),
axis.title.x = element_blank(),
axis.title.y = element_blank(),
panel.border = element_rect(colour = "grey", fill=NA, size=2))
ggmap(map) +
geom_point(data=glm2.1Attributes,
aes(x=Longitude, y=Latitude, colour=factor(ntile(residual,5))),
size=1.9) +
scale_colour_manual(values = c("#d6d6ff","#8f97e3","#556cc9","#244ead","#003994"),
labels=as.character(round(quantile(glm2.1Attributes$residual,
c(.1,.2,.4,.6,.8),na.rm=T),3)),
name="Residual \n(Quintile Breaks)") +
labs(title = "Arrival Residual by Bikeshare Station",
subtitle = 'Weekday Testset; Randomly Selected Hour') +
theme(plot.title=element_text(size=17, face="bold", vjust=-1),
plot.subtitle=element_text(size=12, face="italic", color="grey"),
legend.text = element_text(face = "italic"),
axis.ticks = element_blank(),
axis.text = element_blank(),
axis.title.x = element_blank(),
axis.title.y = element_blank(),
panel.border = element_rect(colour = "grey", fill=NA, size=2))
ggmap(map) +
geom_point(data=glm2.2Attributes,
aes(x=Longitude, y=Latitude, colour=factor(ntile(residual,5))),
size=1.9) +
scale_colour_manual(values = c("#d6d6ff","#8f97e3","#556cc9","#244ead","#003994"),
labels=as.character(round(quantile(glm2.2Attributes$residual,
c(.1,.2,.4,.6,.8),na.rm=T),3)),
name="Residual \n(Quintile Breaks)") +
labs(title = "Arrival Residual by Bikeshare Station",
subtitle = 'Weekend Testset; Randomly Selected Hour') +
theme(plot.title=element_text(size=17, face="bold", vjust=-1),
plot.subtitle=element_text(size=12, face="italic", color="grey"),
legend.text = element_text(face = "italic"),
axis.ticks = element_blank(),
axis.text = element_blank(),
axis.title.x = element_blank(),
axis.title.y = element_blank(),
panel.border = element_rect(colour = "grey", fill=NA, size=2))
Similarly, to see whether there exists temporal autocorrelation in the residuals, we first randomly select a station in the testsets, and then use the ACF to identify the correlations within time series. The graphs below do not suggest any strong correlation among the residuals across time. In other words, the temporal patterns in dependent variables are mostly explained by our models.
## Weekday Departure
set.seed(49)
sample_station <- sample(unique(vars_train$station_id), 1)
autoplot(acf(glm1.1PredValues %>%
filter(station_id == sample_station) %>%
dplyr::select(residual),
plot = FALSE)) +
labs(title = 'ACF for Departure Residuals',
subtitle = 'Weekday Testset; Randomly Selected Station') +
theme_minimal() +
theme(plot.title=element_text(size=18, face="bold", vjust=-1),
plot.subtitle=element_text(size=12, face="italic", color="grey"))
## Weekend Departure
autoplot(acf(glm1.2PredValues %>%
filter(station_id == sample_station) %>%
dplyr::select(residual),
plot = FALSE)) +
labs(title = 'ACF for Departure Residuals',
subtitle = 'Weekend Testset; Randomly Selected Station') +
theme_minimal() +
theme(plot.title=element_text(size=18, face="bold", vjust=-1),
plot.subtitle=element_text(size=12, face="italic", color="grey"))
## Weekday Arrival
autoplot(acf(glm2.1PredValues %>%
filter(station_id == sample_station) %>%
dplyr::select(residual),
plot = FALSE)) +
labs(title = 'ACF for Arrival Residuals',
subtitle = 'Weekday Testset; Randomly Selected Station') +
theme_minimal() +
theme(plot.title=element_text(size=18, face="bold", vjust=-1),
plot.subtitle=element_text(size=12, face="italic", color="grey"))
## Weekend Arrival
autoplot(acf(glm2.2PredValues %>%
filter(station_id == sample_station) %>%
dplyr::select(residual),
plot = FALSE)) +
labs(title = 'ACF for Arrival Residuals',
subtitle = 'Weekend Testset; Randomly Selected Station') +
theme_minimal() +
theme(plot.title=element_text(size=18, face="bold", vjust=-1),
plot.subtitle=element_text(size=12, face="italic", color="grey"))
Regarding accuracy, the models developed here do a very good job. Most residuals range from -1 to 0, meaning that the predictions will just slightly differ from the observed values. When it comes to generalizability, we evaluate the models from three aspects - across datasets, space, and time. Overall, the prediction error is positively associated with the observed value. We consider it is reasonable that the geography and time with larger number of departures/arrivals come with higher MAEs. In this sense, our models are generalizable. Unfortunately however, MAPE, another metric used quite often, does not work in this case, since any unit with an observed departure/arrival of 0 will result in a MAPE of infinity.
We also measure whether there are spatial and temporal autocorrelation in the residuals by randomly selecting an hour and a station, respectively. What we find is that both spatial and temporal autocorrelation are not significant.
The existing Ford GoBike App can tell you the current number of bikes and docks of each station, but fails to tell if there will be available bikes or docks when I actually arrive at the station, and how to plan my trip efficiently.
The new App we design is to make the system more connected, smart, and efficient. Firstly, you can plan the trip by setting a location and departure time. The App will show the predicted number of bikes and docks in the time you choose. Then, it will provide three most recommended routes, and encourage you moving bikes from full station to empty station by giving you bonus points as coupons.
For example, if I know there will be no available bikes in Station A in one hour, I would change my plan and go to another station. This is the rebalancing effect from the demand side. In addition, by giving a bonus to users and encourage them helping move bikes, the app also helps rebalance the system from the supply side.
When entering the app, the user can first choose the origin from the locations used most often, or choose the others by clicking the “Where to” bar. Then the user can choose when to depart. It can be the current time, or a future time today. For example, the user can select 1 p.m. from the drop down menu if he/she wants to leave then.
After setting the location, the app will automatically give three suggested routes, which are based on two criteria: First, it should guarantee there are available bikes when the user arrives at the departure station, and available docks when reaching the destination station. Secondly, it will give bonus points to the routes that go from a bike-sufficient station to a dock-sufficient station. If the user chooses the first route, the app will show the detail of the trip.
If the user is unsatisfied with all three recommended routes, he/she can go back and customize the OD stations. By clicking a specific station, the App will show its detailed information, including current available bikes, predicted arrival, and departure, based on which the number of available bikes when arriving is given. The graph at the bottom shows the hourly available bikes. The predicted numbers in the future are shown as dash line. In addition, it will show how many bonus points will be given if the user unlock a bike here. For Bryant and 6th St station, since it has few available bikes, the user will get zero point if picking it as the origin. Victoria Manalo Draves Park station has lots of predicted available bikes. Hence the user will get 15 points to set it as the origin, because he/she helps rebalance the system.
Similarly, the user will know the stations around the destination, and also the relevant information. The bonus I will get depends on the dock availability when arriving.
After setting the OD stations, the app will give the detailed routes.
As previously mentioned, the patterns of weekday and weekend are totally different. On weekdays, typically there are two peaks when the number of departure and arrival are extremely high. Comparatively, the curves look like a bump, and peak hours do not exist at weekend. When building the regression model, however, we put both weekday and weekend into the same formula. The differences of weekday and weekend would be better identified if we can separate these two.
Admittedly, we did try to model the effect of time on departure/arrival during feature engineering. Nonetheless, what we may ignore is that every station has a different pattern of generating and absorbing trips across time. The following plot illustrates the comparison of observed departure of four randomly selected stations from the training set on Oct. 15th. Apparently, the trends of them do not match, as the Embarcadero at Steuart St Station has three peaks during the day, while 30th St at San Jose Ave Station has zero. It would be beneficial to add a variable that is able to combine the effect of station and time. Or perhaps it is just difficult for a linear model to simulate this kind of trend. In that case, more advanced algorithms would be helpful.
set.seed(49)
vars_train %>%
filter(day == 15) %>%
filter(station_id == sample(station_id, 4)[1]|
station_id == sample(station_id, 4)[2]|
station_id == sample(station_id, 4)[3]|
station_id == sample(station_id, 4)[4]) %>%
mutate(hour = as.numeric(hour)) %>%
ggplot(.) +
geom_line(aes(x = hour, y = departure, color = station)) +
scale_x_continuous(breaks = seq(6, 23, by = 1)) +
labs(title = "Weekday Observed Departure Comparison",
subtitle = 'San Francisco, CA; Randomly Selected Stations',
y = 'Observed Departure') +
theme_minimal() +
theme(plot.title=element_text(size=17, face="bold", vjust=-1),
plot.subtitle=element_text(size=12, face="italic", color="grey"))
One interesting fact we found about our models is that they tend to under-predict in weekday peak hours. Shown in the line plots below, the differences between observed and predicted are marginal except for evening peaks. This indicates that we need to consider how to amplify the impact of weekday peak hours in the next steps.
glm1.1PredValues %>%
filter(station_name == 'S Park St at 3rd St') %>%
dplyr::select(hour, observedDeparture, predictedDeparture) %>%
gather(type, value, observedDeparture:predictedDeparture) %>%
ggplot(.) +
geom_line(aes(x = hour, y = value, color = type)) +
scale_x_continuous(breaks = seq(6, 23, by = 1)) +
scale_y_continuous(breaks = seq(0, 14, by = 2)) +
scale_color_manual(labels = c("Observed", "Predicted"), values = c("skyblue3", "tomato3")) +
labs(title = "Observed vs. Predicted Departure",
subtitle = 'S Park St at 3rd St Station; Weekday Testset',
y = 'Count') +
theme_minimal() +
theme(plot.title=element_text(size=17, face="bold", vjust=-1),
plot.subtitle=element_text(size=12, face="italic", color="grey"),
legend.title=element_blank())
glm1.1PredValues %>%
filter(station_name == '8th St at Brannan St') %>%
dplyr::select(hour, observedDeparture, predictedDeparture) %>%
gather(type, value, observedDeparture:predictedDeparture) %>%
ggplot(.) +
geom_line(aes(x = hour, y = value, color = type)) +
scale_x_continuous(breaks = seq(6, 23, by = 1)) +
scale_y_continuous(breaks = seq(0, 14, by = 2)) +
scale_color_manual(labels = c("Observed", "Predicted"), values = c("skyblue3", "tomato3")) +
labs(title = "Observed vs. Predicted Departure",
subtitle = '8th St at Brannan St Station; Weekday Testset ',
y = 'Count') +
theme_minimal() +
theme(plot.title=element_text(size=17, face="bold", vjust=-1),
plot.subtitle=element_text(size=12, face="italic", color="grey"),
legend.title=element_blank())
Currently we are predicting departure and arrival separately. A more straightforward approach would be to model the difference between arrival and departure. If arrival - departure > 0, it means that the particular station will gain a net increase in the number of available bikes in a certain hour; otherwise it will gain a net increase in the number of available docks.
Below shows the codes developed to process the data.
##----
## Preparation
bike <- read.csv('201810-fordgobike-tripdata.csv',
as.is = TRUE) %>%
filter(start_station_id != 'NULL')
sf_unique <-
bike %>%
group_by(start_station_id) %>%
summarize(count = n(),
start_station_name = first(start_station_name),
lat = first(start_station_latitude),
lon = first(start_station_longitude))
sf_unique <-
sf_unique %>%
st_as_sf(coords = c("lon", "lat"), crs = 4326, agr = "constant") %>%
st_sf() %>%
st_transform(crs = 102641)
sf_boundary <-
read_sf('sf_boundary.shp') %>%
st_transform(crs = 102641) %>%
mutate(area = st_area(.)) %>%
mutate(area = as.numeric(area))
sf_bike <-
st_intersection(sf_boundary, sf_unique) %>%
select(start_station_id, start_station_name)
nn_function <- function(measureFrom,measureTo,k) {
nn <-
get.knnx(measureTo, measureFrom, k)$nn.dist
output <-
as.data.frame(nn) %>%
rownames_to_column(var = "thisPoint") %>%
gather(points, point_distance, V1:ncol(.)) %>%
arrange(as.numeric(thisPoint)) %>%
group_by(thisPoint) %>%
summarize(pointDistance = mean(point_distance)) %>%
arrange(as.numeric(thisPoint)) %>%
dplyr::select(-thisPoint)
return(output)
}
##----
## Dependent Variables
bike <-
bike %>%
mutate(month = substr(start_time, 6, 7)) %>%
mutate(day = substr(start_time, 9, 10)) %>%
mutate(hour = substr(start_time, 12, 13))
n <- length(unique(sf_bike$start_station_name)) * 1 * 31 * 24
empty <-
data.frame(station = rep(unique(sf_bike$start_station_name),
times = n/length(unique(sf_bike$start_station_name))),
month = rep('10', n),
day = rep(c('01', '02', '03', '04', '05', '06', '07',
'08', '09', '10', '11', '12', '13', '14',
'15', '16', '17', '18', '19', '20', '21',
'22', '23', '24', '25', '26', '27', '28',
'29', '30', '31'),
each = n/31),
hour = rep(c('00', '01', '02', '03', '04', '05', '06', '07',
'08', '09', '10', '11', '12', '13', '14', '15',
'16', '17', '18', '19', '20', '21', '22', '23'),
each = length(unique(sf_bike$start_station_name))))
bike_arrival <-
bike %>%
group_by(end_station_id, end_station_name, month, day, hour) %>%
summarize(arrival = n()) %>%
left_join(empty, ., by = c('station' = 'end_station_name',
'month' = 'month',
'day' = 'day',
'hour' = 'hour')) %>%
left_join(., sf_unique, by = c('station' = 'start_station_name')) %>%
mutate(station_id = start_station_id) %>%
select(-start_station_id, -end_station_id, -count) %>%
mutate(arrival = replace_na(arrival, 0)) %>%
arrange(., station_id, day, hour) %>%
mutate(identifier = rep(c(1:744), length(unique(sf_bike$start_station_name))))
bike_departure <-
bike %>%
group_by(start_station_id, start_station_name, month, day, hour) %>%
summarize(departure = n()) %>%
left_join(empty, ., by = c('station' = 'start_station_name',
'month' = 'month',
'day' = 'day',
'hour' = 'hour')) %>%
left_join(., sf_unique, by = c('station' = 'start_station_name')) %>%
mutate(station_id = start_station_id.y) %>%
select(-start_station_id.x, -start_station_id.y, -count) %>%
mutate(departure = replace_na(departure, 0))%>%
arrange(., station_id, day, hour) %>%
mutate(identifier = rep(c(1:744), length(unique(sf_bike$start_station_name))))
##----
## Loading Independent Variables
bus <-
read.socrata('https://data.sfgov.org/resource/ikg8-jrfu.json') %>%
st_as_sf(coords = c("longitude", "latitude"), crs = 4326, agr = "constant") %>%
st_sf() %>%
st_transform(crs = 102641) %>%
select() %>%
mutate(Legend = "Bus_Stop")
park <-
read.socrata('https://data.sfgov.org/Culture-and-Recreation/Recreation-and-Parks-Facilities/xvq2-rjrk') %>%
st_as_sf(coords = c("longitude", "latitude"), crs = 4326, agr = "constant") %>%
st_sf() %>%
st_transform(crs = 102641) %>%
select() %>%
mutate(Legend = "Park")
crash <-
read.csv('Collisions.csv') %>%
filter(COUNT_PED_INJURED > 0 | COUNT_BICYCLIST_INJURED > 0 |
COUNT_PED_KILLED > 0 | COUNT_BICYCLIST_KILLED > 0) %>%
st_as_sf(coords = c("POINT_X", "POINT_Y"), crs = 4326, agr = "constant") %>%
st_sf() %>%
st_transform(crs = 102641) %>%
select() %>%
mutate(Legend = "Crash")
## Distance to bus stops, parks, and traffic accidents
sf_bike.xy <-
sf_bike %>%
cbind(.,st_coordinates(st_centroid(sf_bike))) %>%
st_set_geometry(NULL) %>%
dplyr::select(X,Y) %>%
as.matrix()
bus.xy <-
bus %>%
cbind(.,st_coordinates(st_centroid(bus))) %>%
st_set_geometry(NULL) %>%
dplyr::select(X,Y) %>%
as.matrix()
park.xy <-
park %>%
cbind(.,st_coordinates(st_centroid(park))) %>%
st_set_geometry(NULL) %>%
dplyr::select(X,Y) %>%
as.matrix()
crash.xy <-
crash %>%
cbind(.,st_coordinates(st_centroid(crash))) %>%
st_set_geometry(NULL) %>%
dplyr::select(X,Y) %>%
as.matrix()
dist_Park1 <-
as.data.frame(nn_function(sf_bike.xy, park.xy, 1)) %>%
mutate(dist_Park1 = log(pointDistance)) %>%
select(dist_Park1)
dist_Bus1 <-
as.data.frame(nn_function(sf_bike.xy, bus.xy, 1)) %>%
mutate(log_dist_Bus1 = log(pointDistance)) %>%
select(log_dist_Bus1)
dist_Bus3 <-
as.data.frame(nn_function(sf_bike.xy, bus.xy, 3)) %>%
mutate(log_dist_Bus3 = log(pointDistance)) %>%
select(log_dist_Bus3)
dist_Crash1 <-
as.data.frame(nn_function(sf_bike.xy, crash.xy, 3)) %>%
mutate(log_dist_Crash1 = log(pointDistance)) %>%
select(log_dist_Crash1)
dist_variables <-
cbind(dist_Park1, dist_Bus1, dist_Bus3, dist_Crash1, unique(sf_bike_WGS$start_station_id))
#----
## Neighborhood
neighborhood <-
read_sf('neighborhood.shp') %>%
st_transform(crs = 102641)
sf_bike <-
st_intersection(neighborhood, sf_bike)
#----
## Block Group
sf_bike <-
st_intersection(sf_boundary %>%
select(GEOID, area), sf_bike) %>%
mutate(block_area = area) %>%
select(-area)
#----
## Census & Jobs
job <-
read.csv('jobs.csv') %>%
select(w_geocode, S000) %>%
mutate(w_geocode = as.numeric(substr(w_geocode, 1, 11))) %>%
group_by(w_geocode) %>%
summarize(jobs = sum(S000))
census <-
read.csv('census.csv',
as.is = TRUE)
census <-
left_join(census, job, by = c('GeoID' = 'w_geocode')) %>%
mutate(jobs = replace_na(jobs, 0))
sf_bike <-
sf_bike %>%
mutate(GEOID = as.numeric(GEOID))
#----
## Time
bike_departure <-
bike_departure %>%
mutate(whichday = ifelse(day == '01'|day == '08'|
day == '15'|day == '22'|
day == '29', 'Mon',
ifelse(day == '02'|day == '09'|
day == '16'|day == '23'|
day == '30', 'Tue',
ifelse(day == '03'|day == '10'|
day == '17'|day == '24'|
day == '31', 'Wed',
ifelse(day == '04'|day == '11'|
day == '18'|day == '25', 'Thur',
ifelse(day == '05'|day == '12'|
day == '19'|day == '26', 'Fri',
ifelse(day == '06'|day == '13'|
day == '20'|day == '27', 'Sat', 'Sun')))))))
#----
## Distance to high-volume stations
high_station <-
sf_bike %>%
filter(start_station_id == '67' | start_station_id == '58' | start_station_id == '15')
dist_highvolume <-
st_distance(sf_bike, high_station) %>%
as.data.frame() %>%
cbind(., sf_bike$start_station_id) %>%
mutate(dist_ferryBuilding = as.numeric(V1),
dist_Market = as.numeric(V2),
dist_Townsend = as.numeric(V3),
station_id = sf_bike$start_station_id) %>%
select(station_id, dist_Townsend, dist_ferryBuilding)
#----
## Land Use
landUse <-
read_sf('landuse.shp') %>%
st_transform(crs = 102641) %>%
select(landuse)
bike_buffer_1mile <-
st_buffer(sf_bike, 5280) %>%
mutate(area = as.numeric(st_area(.)))
landuse_stat <-
st_intersection(landUse, bike_buffer_hfmile) %>%
mutate(area = as.numeric(st_area(.))) %>%
filter(landuse != 'MISSING DATA') %>%
filter(landuse != 'Right of Way') %>%
mutate(landuse = ifelse(landuse == 'RETAIL/ENT', 'RETAIL', landuse)) %>%
group_by(start_station_id, landuse) %>%
summarize(sum_area = sum(area)) %>%
st_set_geometry(NULL) %>%
spread(landuse, sum_area, fill = 1) %>%
mutate(area = CIE + MED + MIPS + MIXED + MIXRES + OpenSpace + PDR + RESIDENT + RETAIL + VACANT + VISITOR) %>%
mutate(pct_CIE = CIE/area,
pct_MED = MED/area,
pct_MIPS = MIPS/area,
pct_MIXED = MIXED/area,
pct_MIXRES = MIXRES/area,
pct_OpenSpace = OpenSpace/area,
pct_PDR = PDR/area,
pct_RESIDENT = RESIDENT/area,
pct_RETAIL = RETAIL/area,
pct_VACANT = VACANT/area,
pct_VISITOR = VISITOR/area) %>%
mutate(entropy = pct_CIE*(-log(pct_CIE)/log(11))+
pct_MED*(-log(pct_MED)/log(11))+
pct_MIPS*(-log(pct_MIPS)/log(11))+
pct_MIXED*(-log(pct_MIXED)/log(11))+
pct_MIXRES*(-log(pct_MIXRES)/log(11))+
pct_OpenSpace*(-log(pct_OpenSpace)/log(11))+
pct_PDR*(-log(pct_PDR)/log(11))+
pct_RESIDENT*(-log(pct_RESIDENT)/log(11))+
pct_RETAIL*(-log(pct_RETAIL)/log(11))+
pct_VACANT*(-log(pct_VACANT)/log(11))+
pct_VISITOR*(-log(pct_VISITOR)/log(11)))
#----
## Past Departure/Arrival
bike_departure_timeLag <-
bike_departure %>%
mutate(identifier_1h = identifier + 1,
identifier_2h = identifier + 2,
identifier_3h = identifier + 3,
identifier_24h = identifier + 24,
identifier_1w = identifier + 168) %>%
select(station_id, departure, identifier, identifier_1h,
identifier_2h, identifier_3h, identifier_24h, identifier_1w)
bike_arrival_timelag <-
bike_arrival %>%
mutate(identifier_1h = identifier + 1,
identifier_2h = identifier + 2,
identifier_3h = identifier + 3,
identifier_24h = identifier + 24,
identifier_1w = identifier + 168,
identifier_1w1h = identifier + 169) %>%
select(station_id, arrival, identifier, identifier_1h, identifier_2h,
identifier_3h, identifier_24h, identifier_1w, identifier_1w1h)
bike_departure <-
left_join(bike_departure, bike_departure_timeLag,
by = c('identifier' = 'identifier_1h',
'station_id' = 'station_id')) %>%
select(station_id, station, month, day, hour, departure.x,
weekday, peak, whichday, departure.y, identifier) %>%
rename(departure = departure.x,
departure_1h = departure.y) %>%
left_join(., bike_departure_timeLag,
by = c('identifier' = 'identifier_2h',
'station_id' = 'station_id')) %>%
select(station_id, station, month, day, hour, departure.x,
weekday, peak, whichday, departure_1h, departure.y, identifier) %>%
rename(departure = departure.x,
departure_2h = departure.y) %>%
left_join(., bike_departure_timeLag,
by = c('identifier' = 'identifier_3h',
'station_id' = 'station_id')) %>%
select(station_id, station, month, day, hour, departure.x, weekday,
peak, whichday, departure_1h, departure_2h, departure.y, identifier) %>%
rename(departure = departure.x,
departure_3h = departure.y) %>%
left_join(., bike_departure_timeLag,
by = c('identifier' = 'identifier_24h',
'station_id' = 'station_id')) %>%
select(station_id, station, month, day, hour, departure.x, weekday, peak, whichday,
departure_1h, departure_2h, departure_3h, departure.y, identifier) %>%
rename(departure = departure.x,
departure_24h = departure.y) %>%
left_join(., bike_departure_timeLag,
by = c('identifier' = 'identifier_1w',
'station_id' = 'station_id')) %>%
select(station_id, station, month, day, hour, departure.x, weekday, peak, whichday,
departure_1h, departure_2h, departure_3h, departure_24h, departure.y, identifier) %>%
rename(departure = departure.x,
departure_1w = departure.y) %>%
left_join(., bike_arrival_timelag,
by = c('identifier' = 'identifier',
'station_id' = 'station_id')) %>%
select(station_id, station, month, day, hour, weekday, peak, whichday, departure, departure_1h,
departure_2h, departure_3h, departure_24h, departure_1w, arrival, identifier) %>%
left_join(., bike_arrival_timelag,
by = c('identifier' = 'identifier_1h',
'station_id' = 'station_id')) %>%
select(station_id, station, month, day, hour, weekday, peak, whichday, departure, departure_1h,
departure_2h, departure_3h, departure_24h, departure_1w, arrival.x, arrival.y, identifier) %>%
rename(arrival = arrival.x,
arrival_1h = arrival.y) %>%
left_join(., bike_arrival_timelag,
by = c('identifier' = 'identifier_2h',
'station_id' = 'station_id')) %>%
select(station_id, station, month, day, hour, weekday, peak, whichday, departure, departure_1h,
departure_2h, departure_3h, departure_24h, departure_1w, arrival.x, arrival_1h, arrival.y, identifier) %>%
rename(arrival = arrival.x,
arrival_2h = arrival.y) %>%
left_join(., bike_arrival_timelag,
by = c('identifier' = 'identifier_3h',
'station_id' = 'station_id')) %>%
select(station_id, station, month, day, hour, weekday, peak, whichday, departure, departure_1h,
departure_2h, departure_3h, departure_24h, departure_1w, arrival.x, arrival_1h, arrival_2h, arrival.y, identifier) %>%
rename(arrival = arrival.x,
arrival_3h = arrival.y) %>%
left_join(., bike_arrival_timelag,
by = c('identifier' = 'identifier_24h',
'station_id' = 'station_id')) %>%
select(station_id, station, month, day, hour, weekday, peak, whichday, departure, departure_1h,
departure_2h, departure_3h, departure_24h, departure_1w, arrival.x, arrival_1h, arrival_2h,
arrival_3h, arrival.y, identifier) %>%
rename(arrival = arrival.x,
arrival_24h = arrival.y) %>%
left_join(., bike_arrival_timelag,
by = c('identifier' = 'identifier_1w',
'station_id' = 'station_id')) %>%
select(station_id, station, month, day, hour, weekday, peak, whichday, departure, departure_1h,
departure_2h, departure_3h, departure_24h, departure_1w, arrival.x, arrival_1h, arrival_2h,
arrival_3h, arrival_24h, arrival.y, identifier) %>%
rename(arrival = arrival.x,
arrival_1w = arrival.y) %>%
left_join(., bike_arrival_timelag,
by = c('identifier' = 'identifier_1w1h',
'station_id' = 'station_id')) %>%
select(station_id, station, month, day, hour, weekday, peak, whichday, departure, departure_1h,
departure_2h, departure_3h, departure_24h, departure_1w, arrival.x, arrival_1h, arrival_2h,
arrival_3h, arrival_24h, arrival_1w, arrival.y, identifier) %>%
rename(arrival = arrival.x,
arrival_1w1h = arrival.y)
#----
## Nearby Departure/Arrival
sf_bike <-
sf_bike %>%
mutate(fakeID = as.numeric(rownames(.)))
nearest_table <-
get.knnx(sf_bike.xy, sf_bike.xy, 6)$`nn.index` %>%
as.data.frame() %>%
rename(fakeID = V1,
nearest1 = V2,
nearest2 = V3,
nearest3 = V4,
nearest4 = V5,
nearest5 = V6) %>%
left_join(.,
sf_bike %>%
st_set_geometry(NULL) %>%
select(fakeID, start_station_id),
by = c('nearest1' = 'fakeID')) %>%
rename(nearest_1 = start_station_id) %>%
left_join(.,
sf_bike %>%
st_set_geometry(NULL) %>%
select(fakeID, start_station_id),
by = c('nearest2' = 'fakeID')) %>%
rename(nearest_2 = start_station_id) %>%
left_join(.,
sf_bike %>%
st_set_geometry(NULL) %>%
select(fakeID, start_station_id),
by = c('nearest3' = 'fakeID')) %>%
rename(nearest_3 = start_station_id) %>%
left_join(.,
sf_bike %>%
st_set_geometry(NULL) %>%
select(fakeID, start_station_id),
by = c('nearest4' = 'fakeID')) %>%
rename(nearest_4 = start_station_id) %>%
left_join(.,
sf_bike %>%
st_set_geometry(NULL) %>%
select(fakeID, start_station_id),
by = c('nearest5' = 'fakeID')) %>%
rename(nearest_5 = start_station_id) %>%
cbind(sf_bike$start_station_id, .) %>%
mutate(station_id = sf_bike$start_station_id) %>%
select(nearest_1:station_id)
bike_departure_full <-
left_join(bike_departure, nearest_table, by = c('station_id')) %>%
left_join(.,
bike_departure %>%
select(station_id, day, hour, departure, departure_1h, departure_2h,
departure_3h, departure_24h, departure_1w, arrival, arrival_1h,
arrival_2h, arrival_3h, arrival_24h, arrival_1w, arrival_1w1h),
by = c('nearest_1' = 'station_id',
'day' = 'day',
'hour' = 'hour')) %>%
left_join(.,
bike_departure %>%
select(station_id, day, hour, departure, departure_1h, departure_2h,
departure_3h, departure_24h, departure_1w, arrival, arrival_1h,
arrival_2h, arrival_3h, arrival_24h, arrival_1w, arrival_1w1h),
by = c('nearest_2' = 'station_id',
'day' = 'day',
'hour' = 'hour')) %>%
left_join(.,
bike_departure %>%
select(station_id, day, hour, departure, departure_1h, departure_2h,
departure_3h, departure_24h, departure_1w, arrival, arrival_1h,
arrival_2h, arrival_3h, arrival_24h, arrival_1w, arrival_1w1h),
by = c('nearest_3' = 'station_id',
'day' = 'day',
'hour' = 'hour')) %>%
left_join(.,
bike_departure %>%
select(station_id, day, hour, departure, departure_1h, departure_2h,
departure_3h, departure_24h, departure_1w, arrival, arrival_1h,
arrival_2h, arrival_3h, arrival_24h, arrival_1w, arrival_1w1h),
by = c('nearest_4' = 'station_id',
'day' = 'day',
'hour' = 'hour')) %>%
left_join(.,
bike_departure %>%
select(station_id, day, hour, departure, departure_1h, departure_2h,
departure_3h, departure_24h, departure_1w, arrival, arrival_1h,
arrival_2h, arrival_3h, arrival_24h, arrival_1w, arrival_1w1h),
by = c('nearest_5' = 'station_id',
'day' = 'day',
'hour' = 'hour'))
names(bike_departure_full) <-
gsub('.x.x.x', '_n4', names(bike_departure_full), fixed = TRUE)
names(bike_departure_full) <-
gsub('.y.y.y', '_n5', names(bike_departure_full), fixed = TRUE)
names(bike_departure_full) <-
gsub('.x.x', '_n2', names(bike_departure_full), fixed = TRUE)
names(bike_departure_full) <-
gsub('.y.y', '_n3', names(bike_departure_full), fixed = TRUE)
names(bike_departure_full) <-
gsub('.y', '_n1', names(bike_departure_full), fixed = TRUE)
names(bike_departure_full) <-
gsub('.x', '', names(bike_departure_full), fixed = TRUE)
bike_departure_full <-
bike_departure_full %>%
mutate(nearest3_departure = (departure_n1 + departure_n2 + departure_n3) / 3,
nearest3_departure_1h = (departure_1h_n1 + departure_1h_n2 + departure_1h_n3) / 3,
nearest3_departure_2h = (departure_2h_n1 + departure_2h_n2 + departure_2h_n3) / 3,
nearest3_departure_3h = (departure_3h_n1 + departure_3h_n2 + departure_3h_n3) / 3,
nearest3_departure_24h = (departure_24h_n1 + departure_24h_n2 + departure_24h_n3) / 3,
nearest3_departure_1w = (departure_1w_n1 + departure_1w_n2 + departure_1w_n3) / 3,
nearest3_arrival = (arrival_n1 + arrival_n2 + arrival_n3) / 3,
nearest3_arrival_1h = (arrival_1h_n1 + arrival_1h_n2 + arrival_1h_n3) / 3,
nearest3_arrival_2h = (arrival_2h_n1 + arrival_2h_n2 + arrival_2h_n3) / 3,
nearest3_arrival_3h = (arrival_3h_n1 + arrival_3h_n2 + arrival_3h_n3) / 3,
nearest3_arrival_24h = (arrival_24h_n1 + arrival_24h_n2 + arrival_24h_n3) / 3,
nearest3_arrival_1w = (arrival_1w_n1 + arrival_1w_n2 + arrival_1w_n3) / 3,
nearest3_arrival_1w1h = (arrival_1w1h_n1 + arrival_1w1h_n2 + arrival_1w1h_n3) / 3,
nearest5_departure = (departure_n1 + departure_n2 + departure_n3 + departure_n4 + departure_n5) / 5,
nearest5_departure_1h = (departure_1h_n1 + departure_1h_n2 + departure_1h_n3 + departure_1h_n4 + departure_1h_n5) / 5,
nearest5_departure_2h = (departure_2h_n1 + departure_2h_n2 + departure_2h_n3 + departure_2h_n4 + departure_2h_n5) / 5,
nearest5_departure_3h = (departure_3h_n1 + departure_3h_n2 + departure_3h_n3 + departure_3h_n4 + departure_3h_n5) / 5,
nearest5_departure_24h = (departure_24h_n1 + departure_24h_n2 + departure_24h_n3 + departure_24h_n4 + departure_24h_n5) / 5,
nearest5_departure_1w = (departure_1w_n1 + departure_1w_n2 + departure_1w_n3 + departure_1w_n4 + departure_1w_n5) / 5,
nearest5_arrival = (arrival_n1 + arrival_n2 + arrival_n3 + arrival_n4 + arrival_n5) / 5,
nearest5_arrival_1h = (arrival_1h_n1 + arrival_1h_n2 + arrival_1h_n3 + arrival_1h_n4 + arrival_1h_n5) / 5,
nearest5_arrival_2h = (arrival_2h_n1 + arrival_2h_n2 + arrival_2h_n3 + arrival_2h_n4 + arrival_2h_n5) / 5,
nearest5_arrival_3h = (arrival_3h_n1 + arrival_3h_n2 + arrival_3h_n3 + arrival_3h_n4 + arrival_3h_n5) / 5,
nearest5_arrival_24h = (arrival_24h_n1 + arrival_24h_n2 + arrival_24h_n3 + arrival_24h_n4 + arrival_24h_n5) / 5,
nearest5_arrival_1w = (arrival_1w_n1 + arrival_1w_n2 + arrival_1w_n3 + arrival_1w_n4 + arrival_1w_n5) / 5,
nearest5_arrival_1w1h = (arrival_1w1h_n1 + arrival_1w1h_n2 + arrival_1w1h_n3 + arrival_1w1h_n4 + arrival_1w1h_n5) / 5) %>%
select(station_id:nearest_5, nearest3_departure:nearest5_arrival_1w1h)
#----
## Weather
weather <-
read.csv('SF_climate_hourly.csv',
as.is = TRUE)
#----
## Integrating into one dataset
vars <-
left_join(dist_variables, dist_landmark, by = c('start_station_id' = 'start_station_id')) %>%
left_join(., landUse_summary, by = c('start_station_id' = 'station_id')) %>%
left_join(., dist_highvolume, by = c('start_station_id' = 'station_id')) %>%
left_join(sf_bike %>%
dplyr::select(start_station_id, geoid10, GEOID, nhood, block_area, taz),
.,
by = c('start_station_id' = 'start_station_id')) %>%
left_join(., census, by = c('GEOID.x' = 'GeoID')) %>%
left_join(bike_departure_full, ., by = c('station_id' = 'start_station_id')) %>%
left_join(., weather, by = c('day' = 'Date',
'hour' = "HOUR")) %>%
mutate(Wind.Speed = as.numeric(substr(Wind.Speed, 1, 1))) %>%
mutate(eastMost = ifelse(nhood == 'Financial District/South Beach' | nhood == 'Mission Bay' | nhood == 'South of Market', 'in', 'out')) %>%
mutate(weekday_hot = ifelse(station == 'San Francisco Caltrain Station 2 (Townsend St at 4th St)' |
station == 'San Francisco Ferry Building (Harry Bridges Plaza)' |
station == 'San Francisco Caltrain (Townsend St at 4th St)' |
station == 'Montgomery St BART Station (Market St at 2nd St)' |
station == 'Berry St at 4th St' |
station == 'Powell St BART Station (Market St at 4th St)' |
station == 'Steuart St at Market St' |
station == 'The Embarcadero at Sansome St' |
station == 'Howard St at Beale St', 1, 0)) %>%
mutate(weekend_hot = ifelse(weekday == 'weekday', 0, weekend_hot)) %>%
filter(hour != '0' & hour != '1' & hour != '2' & hour != '3' & hour != '4' & hour != '5')
#----
## Training and Test Set
vars_train <-
vars %>%
filter(day >= 14 & day <= 20,
station_id != '80' & station_id != '344')
vars_test1 <-
vars %>%
filter(day == 22,
station_id != '80' & station_id != '344')
vars_test2 <-
vars %>%
filter(day == 21,
station_id != '80' & station_id != '344')